Code
library(tidyverse)
library(ggridges)
library(RColorBrewer)
library(broom)
library(gganimate)
library(gifski)Investigating the Relationship Between Income and Sanitation
library(tidyverse)
library(ggridges)
library(RColorBrewer)
library(broom)
library(gganimate)
library(gifski)sanitation.data <- read_csv("at_least_basic_sanitation_overall_access_percent.csv")
income.data <- read_csv("income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
colnames(income.data) <- c("country", as.numeric(1799:2049))head(select(income.data, 1:5))# A tibble: 6 × 5
country `1799` `1800` `1801` `1802`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 683 683 683 683
2 Angola 700 702 705 709
3 Albania 755 755 755 755
4 Andorra 1360 1360 1360 1360
5 United Arab Emirates 1130 1130 1140 1140
6 Argentina 1730 1730 1740 1740
The income data set measures the total GDP of a country per person for 216 countries between the years of 1892 and 2049, with units in fixed 2017 prices. This number is adjusted for PPP, or the differences between costs of living and essential products, across countries. The data comes from the World Bank, the Maddison Project Database, and the Penn World Table. Historical estimates were used for early years; forecasts from the IMF’s Economic Outlook were used to project income in the future.
head(select(sanitation.data, 1:5))# A tibble: 6 × 5
country `1999` `2000` `2001` `2002`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Aruba 98.3 98.3 98.2 98.1
2 Afghanistan 21.9 21.9 23.3 24.7
3 Angola 27.6 29 30.4 31.9
4 Albania 89.5 90 90.6 91.2
5 Andorra 100 100 100 100
6 United Arab Emirates NA NA 97.4 97.4
The sanitation data set measures the percentage of people (living in both urban and rural settings) who use at lease basic sanitation services not shared with other households. This includes:
The data was collected by the World Health Organization and UNICEF, falling between the years of 1999 to 2019.
We hypothesize that there will be a positive relation between income per person and percentage of people with basic sanitation at their disposal. We would also like to investigate and highlight any potential outliers, such as moments in time where certain countries failed to improve sanitation with increased income, or drastic dips/surges in income/sanitation over a single year.
convert_num <- function(x){ #converts "12.3k" to 12300,
temp <- x
if(str_detect(x, "k$")){
temp <- x |>
str_replace_all("[^[:digit:]]", "") |> #gross but it just replaces any non-number with ""
as.numeric() * 100
}
return(temp)
}
#data cleaning + joining
sanitation.clean <- sanitation.data |>
drop_na() |>
pivot_longer(cols = `1999`:`2019`,
names_to = "year",
values_to = "sanitation")
income.clean <- income.data |>
drop_na() |> #pretty bold step to remove any rows with na vals,
#but due to the abundance of data seems reasonable
select(c(country, `1999`:`2019`)) |>
pivot_longer(cols = `1999`:`2019`,
names_to = "year",
values_to = "income") |>
mutate(income = as.numeric(map(income, convert_num)))#converts characters,
#able to convert after pivot, b/c data is all seen as strings!The cleaning process above is relatively simple, converting some data entries of the form \(12.3k \to 12300\). Notice that the decision to drop NA values was made, since ggplot and lm will drop NA values and we have an abundance of data entries. This decision will save headaches later down the road when taking advantage of the country column.
full.data <- inner_join(sanitation.clean, income.clean)
head(full.data)# A tibble: 6 × 4
country year sanitation income
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1999 21.9 584
2 Afghanistan 2000 21.9 569
3 Afghanistan 2001 23.3 1190
4 Afghanistan 2002 24.7 1240
5 Afghanistan 2003 26.1 1200
6 Afghanistan 2004 27.5 1290
We aim to piece together the relationship between income and sanitation levels.
Our goal is to develop a model that predicts the percent of people with basic sanitation (response) from adjusted income per person (explanatory).
We first visualize the relation between these two variables:
full.plot <- full.data |>
ggplot(mapping = aes(x = income, y = sanitation) ) +
geom_point() +
geom_smooth(method = lm, level = 0, color = "red") +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,100,25)) +
labs(title = "Income versus Sanitation, 1999-2049",
subtitle = "Percentage of People with (at least) Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $)",
y = "")
full.plottemp.plot <- full.data |>
mutate(year = as.numeric(year)) |>
ggplot(mapping = aes(x = income,
y = sanitation)) +
geom_point(show.legend = F) +
transition_time(year) +
labs(title = "Income versus Sanitation, 1999-2019",
subtitle = "Year: {floor(frame_time)}",
x = "Adjusted Income per Person (in 2017 $)",
y = "Percentage of People with (at least) Basic Sanitation Services")
temp.plot#we can make it so that the animation only plasy once, just didnt have timeAlthough the above scatter plot is very messy, we can see a general pattern: countries with low adjusted income per person have lower percentages of people with basic sanitation, whereas countries with high income per person never have less than 90 percent of people with basic sanitation.
However, there are instances where countries with very low income per person have high percentages of people with basic sanitation. Below you can see that the United States has had two years between 1999 and 2019 where adjusted income per person dipped below $20,000, but over 99% of Americans were able to have access to basic sanitation.
full.data |>
mutate(year = as.numeric(year)) |>
filter(country == "United States") |>
ggplot(mapping = aes(x = income, y = sanitation)) +
geom_point() +
transition_time(year) +
labs(title = "United States Income vs. Sanitation, 1999-2019",
subtitle = "Year: {floor(frame_time)}",
x = "Adjusted Income Per Person (in 2017 $)",
y = "Percentage of People with (at least) Basic Santitation") +
shadow_mark(alpha = 1,
size = 1)Despite an obvious general trend, further investigation is needed to gain insight on descrepancies like the ones above.
We look at the relationship between average income per person and average percentage of people with basic sanitation at their disposal across 1999-2019.
# take average income and sanitation for each year, then plot?
# cleveland dot plots to better indicate year
# 1999 is lowest, 2019 is highest, everything else already in order.
full.data |>
group_by(year) |>
summarise(mean_income = mean(income),
mean_sanitation = mean(sanitation)) |>
ggplot(mapping = aes(x = mean_income,
y = mean_sanitation#,
#color = fct_reorder2(year, mean_income, mean_sanitation)
)
) +
geom_segment(mapping = aes(xend = 0, yend = mean_sanitation)) +
geom_point() +
annotate("text", x = 17000, y = 76.6, label = "2019") +
annotate("text", x = 18700, y = 76.1, label = "2018") +
annotate("text", x = 17600, y = 75.6, label = "2017") +
annotate("text", x = 19200, y = 75.2, label = "2016") +
annotate("text", x = 19200, y = 74.6, label = "2015") +
annotate("text", x = 17500, y = 74.1, label = "2014") +
annotate("text", x = 17000, y = 73.5, label = "2013") +
annotate("text", x = 17700, y = 73.0, label = "2012") +
annotate("text", x = 16900, y = 72.4, label = "2011") +
annotate("text", x = 17500, y = 71.9, label = "2010") +
annotate("text", x = 16100, y = 71.3, label = "2009") +
annotate("text", x = 16600, y = 70.7, label = "2008") +
annotate("text", x = 17200, y = 70.1, label = "2007") +
annotate("text", x = 17200, y = 69.5, label = "2006") +
annotate("text", x = 17200, y = 69, label = "2005") +
annotate("text", x = 15900, y = 68.5, label = "2004") +
annotate("text", x = 15200, y = 67.9, label = "2003") +
annotate("text", x = 14700, y = 67.4, label = "2002") +
annotate("text", x = 15000, y = 66.9, label = "2001") +
annotate("text", x = 14400, y = 66.4, label = "2000") +
annotate("text", x = 14600, y = 65.8, label = "1999") +
labs(title = "Average Income versus Average Sanitation, 1999-2019",
subtitle = "Average Percentage of People with Basic Sanitation",
x = "Average Income per Person (in 2017 $)",
y = "",
color = "Year")Plotting the aggregate average income per person and average sanitation percentage across all countries, the trend seems to follow a increasing trend. Again, further investigation (and possibly historical analysis) is required in otder to decipher the dips in cleanliness around time frames such as the late aughts and the early 2010’s.
Upon fitting a simple linear regression model for percent of people who have access to basic sanitation (\(Y\)) based on adjusted income per person (\(X\)), we acquired the following regression equation: \[\hat{Y} = 55.75 + 0.0009699X\]
Our regression equation tells us that a country with an average adjusted income of $0 will have 55.75% of its people with access to basic sanitation services. Further, for every ten thousand dollars increase in adjusted income per person, a country should expect a 9.699 percent increase in the amount of people who have access to basic sanitation.
To assess the validity of our linear model, we look at the variances in observed sanitation percentage, predicted sanitation percentage, and residuals:
model_var <- augment(data.fit) |>
summarise(var.resp = var(sanitation),
var.fitted = var(.fitted),
var.resid =var(.resid))
model_var# A tibble: 1 × 3
var.resp var.fitted var.resid
<dbl> <dbl> <dbl>
1 943. 314. 629.
The proportion of variability accounted for by our model was 0.333. Our model does a good enough job to communicate the general idea that more money means people are going to be cleaner. However, the variable we are hoping to predict is a percentage which means it can not exceed 100%, exposing the flaw that our model will predict impossible to reach sanitation percentages for high enough average income inputs. This gives us problems regarding what we can and can not conclude due to risk of extrapolating further than our data provides. There is not much we can do about this using a simple linear model as we would have to introduce a polynomial regression equation (say, through a logarithmic transformation) to be more precise.